fix(azure): prepend bucket prefix to all Azure Blob storage paths#14561
fix(azure): prepend bucket prefix to all Azure Blob storage paths#14561Ricardo-M-L wants to merge 3 commits into
Conversation
The Azure SPN (`azure_spn_conn.py`) and Azure SAS (`azure_sas_conn.py`) storage backends ignored the `bucket` parameter and stored every file flat under the container using only `fnm`. As a result, files with the same name uploaded from different datasets (different `kb_id`s) silently overwrote each other in Azure Blob storage. The MinIO and S3 backends already prepend the bucket as a path prefix (`<bucket>/<fnm>`) for logical isolation. This change applies the same pattern to both Azure backends in `put`, `get`, `rm`, `obj_exist`, `get_presigned_url`, and the `health` probe so all operations use consistent paths. Adds unit tests under `test/unit_test/rag/utils/` covering each method and a regression case verifying that two datasets uploading the same filename produce two distinct storage paths. Fixes infiniflow#14159 Signed-off-by: Ricardo-M-L <ricardoporsche001@icloud.com>
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAzure SPN and SAS connectors now consistently prefix bucket to filenames ("{bucket}/{fnm}") across health, put, get, rm, obj_exist, and get_presigned_url. Unit tests with stubs verify the prefixing and that identical filenames in different buckets do not collide. ChangesAzure Blob Path Prefix Alignment
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rag/utils/azure_spn_conn.py (1)
71-82:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix premature exit in retry loop in
put.Line 81 returns from inside
except, so thefor _ in range(3)loop never retries. This nullifies transient-failure recovery on writes.Proposed fix
def put(self, bucket, fnm, binary, tenant_id=None): for _ in range(3): try: f = self.conn.create_file(f"{bucket}/{fnm}") f.append_data(binary, offset=0, length=len(binary)) return f.flush_data(len(binary)) except Exception: logging.exception(f"Fail put {bucket}/{fnm}") self.__open__() time.sleep(1) - return None return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/utils/azure_spn_conn.py` around lines 71 - 82, The put method's retry loop is broken because the except block returns immediately, preventing further retries; update put (method put in class using self.conn.create_file and f.append_data/f.flush_data) to remove the premature return inside the except so that on exception it logs the error, calls self.__open__(), sleeps, and then continues the for _ in range(3) loop to retry; after the loop finishes without success, return None (preserve existing successful return of f.flush_data when no exception).
🧹 Nitpick comments (1)
test/unit_test/rag/utils/test_azure_blob_bucket_prefix.py (1)
109-186: ⚡ Quick winAdd health() coverage since health path construction changed in this PR.
Current tests validate put/get/rm/obj_exist/get_presigned_url but not
health()for either connector. A small assertion on the called path/name would close that gap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/unit_test/rag/utils/test_azure_blob_bucket_prefix.py` around lines 109 - 186, Add a new test in both TestAzureSpnBucketPrefix and TestAzureSasBucketPrefix that calls the connector health() (on the RAGFlowAzureSpnBlob and RAGFlowAzureSasBlob instances) with a bucket like "kb_a" and asserts the underlying mocked connection was invoked with a path containing the bucket prefix; for the SPN test call spn.health("kb_a") and assert any recorded call in spn.conn.call_args_list has a first positional arg that starts with "kb_a/" (or contains "kb_a"), and for the SAS test call sas.health("kb_a") and assert any sas.conn.upload/download/delete/get call_args/kwargs include a "name" or positional arg that starts with "kb_a/". Ensure you reference the instance names spn and sas, the health() method, and the conn mock when adding the assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rag/utils/azure_sas_conn.py`:
- Around line 80-84: The obj_exist method logs "Fail put ..." on exceptions
which is misleading; change the logging.exception call inside obj_exist to a
existence-specific message like "Fail checking existence of {bucket}/{fnm}"
(include the bucket/fnm) and keep logging.exception so the exception details are
recorded; also ensure obj_exist returns a boolean on error (e.g., return False)
so callers of obj_exist (which calls self.conn.get_blob_client(...).exists())
receive a consistent result.
- Line 65: The delete/get/exists/read methods currently call self.conn using the
new path format f"{bucket}/{fnm}" (e.g., the
self.conn.delete_blob(f"{bucket}/{fnm}") call) but must fall back to the legacy
plain fnm if the new path is not found; update each occurrence (the delete_blob,
get_blob/read operations and existence checks at the same spots) to first try
the f"{bucket}/{fnm}" call and on a not-found/error response attempt the legacy
fnm value before failing, returning the successful result or logging/raising the
original error only if both attempts fail.
In `@rag/utils/azure_spn_conn.py`:
- Around line 86-87: Modify the non-write operations to try the new prefixed
path first and fall back to the legacy plain filename on failure: for the delete
flow using self.conn.delete_file(f"{bucket}/{fnm}") (and similarly for
get/get_presigned_url/obj_exist), catch exceptions or a false/None result and
then attempt the same operation with the legacy path self.conn.delete_file(fnm)
(or self.conn.get_object(fnm), self.conn.get_presigned_url(fnm),
self.conn.object_exists(fnm) as appropriate); ensure both attempts preserve
return values/errors so callers receive the correct success/failure outcome and
avoid raising the first-path exception when the legacy attempt succeeds.
- Around line 102-107: The obj_exist method currently logs a misleading message
"Fail put ..." on exception; update the exception logging in obj_exist to use an
existence-specific message such as "Fail checking existence for {bucket}/{fnm}"
(use the same f-string pattern) so logs accurately reflect the operation and
include the bucket/fnm details; modify the logging.exception call inside
obj_exist to that new message.
In `@test/unit_test/rag/utils/test_azure_blob_bucket_prefix.py`:
- Around line 31-33: The helper _install_stubs currently mutates sys.modules
directly and never restores entries, causing cross-test pollution; change the
stubbing to use pytest's monkeypatch.setitem to insert the fake modules (e.g.,
replace sys.modules["ragflow.runtime"] etc.) so patches are automatically
reverted after each test, and update the three other stub blocks (the ones
creating fake modules around the same area) to use monkeypatch.setitem instead
of direct sys.modules assignment or add explicit teardown that restores
originals if monkeypatch is unavailable; locate uses in function _install_stubs
and the other stub creation sites and replace sys.modules[...] = fake_module
with monkeypatch.setitem(sys.modules, key, fake_module).
---
Outside diff comments:
In `@rag/utils/azure_spn_conn.py`:
- Around line 71-82: The put method's retry loop is broken because the except
block returns immediately, preventing further retries; update put (method put in
class using self.conn.create_file and f.append_data/f.flush_data) to remove the
premature return inside the except so that on exception it logs the error, calls
self.__open__(), sleeps, and then continues the for _ in range(3) loop to retry;
after the loop finishes without success, return None (preserve existing
successful return of f.flush_data when no exception).
---
Nitpick comments:
In `@test/unit_test/rag/utils/test_azure_blob_bucket_prefix.py`:
- Around line 109-186: Add a new test in both TestAzureSpnBucketPrefix and
TestAzureSasBucketPrefix that calls the connector health() (on the
RAGFlowAzureSpnBlob and RAGFlowAzureSasBlob instances) with a bucket like "kb_a"
and asserts the underlying mocked connection was invoked with a path containing
the bucket prefix; for the SPN test call spn.health("kb_a") and assert any
recorded call in spn.conn.call_args_list has a first positional arg that starts
with "kb_a/" (or contains "kb_a"), and for the SAS test call sas.health("kb_a")
and assert any sas.conn.upload/download/delete/get call_args/kwargs include a
"name" or positional arg that starts with "kb_a/". Ensure you reference the
instance names spn and sas, the health() method, and the conn mock when adding
the assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f0b953d4-2300-4db3-b28e-23038f1208e4
📒 Files selected for processing (3)
rag/utils/azure_sas_conn.pyrag/utils/azure_spn_conn.pytest/unit_test/rag/utils/test_azure_blob_bucket_prefix.py
| def rm(self, bucket, fnm): | ||
| try: | ||
| self.conn.delete_blob(fnm) | ||
| self.conn.delete_blob(f"{bucket}/{fnm}") |
There was a problem hiding this comment.
Mirror a legacy-path fallback here as well for upgrade safety.
Like the SPN connector, non-write methods now only address "{bucket}/{fnm}". Without fallback to legacy fnm, pre-existing blobs written before this PR can become inaccessible.
Also applies to: 72-72, 82-82, 90-90
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag/utils/azure_sas_conn.py` at line 65, The delete/get/exists/read methods
currently call self.conn using the new path format f"{bucket}/{fnm}" (e.g., the
self.conn.delete_blob(f"{bucket}/{fnm}") call) but must fall back to the legacy
plain fnm if the new path is not found; update each occurrence (the delete_blob,
get_blob/read operations and existence checks at the same spots) to first try
the f"{bucket}/{fnm}" call and on a not-found/error response attempt the legacy
fnm value before failing, returning the successful result or logging/raising the
original error only if both attempts fail.
| def obj_exist(self, bucket, fnm): | ||
| try: | ||
| return self.conn.get_blob_client(fnm).exists() | ||
| return self.conn.get_blob_client(f"{bucket}/{fnm}").exists() | ||
| except Exception: | ||
| logging.exception(f"Fail put {bucket}/{fnm}") |
There was a problem hiding this comment.
Fix misleading log verb in obj_exist exception path.
obj_exist currently logs "Fail put ...". This should be existence-specific to keep storage diagnostics accurate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag/utils/azure_sas_conn.py` around lines 80 - 84, The obj_exist method logs
"Fail put ..." on exceptions which is misleading; change the logging.exception
call inside obj_exist to a existence-specific message like "Fail checking
existence of {bucket}/{fnm}" (include the bucket/fnm) and keep logging.exception
so the exception details are recorded; also ensure obj_exist returns a boolean
on error (e.g., return False) so callers of obj_exist (which calls
self.conn.get_blob_client(...).exists()) receive a consistent result.
| self.conn.delete_file(f"{bucket}/{fnm}") | ||
| except Exception: |
There was a problem hiding this comment.
Add legacy-path fallback for non-write operations to avoid post-deploy data invisibility.
After this change, get/rm/obj_exist/get_presigned_url only target "{bucket}/{fnm}". Any objects previously written as plain fnm become unreachable after rollout. Please add a fallback read/delete/existence/presign attempt to legacy fnm (prefixed path first, legacy path second).
Also applies to: 93-94, 104-105, 113-113
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag/utils/azure_spn_conn.py` around lines 86 - 87, Modify the non-write
operations to try the new prefixed path first and fall back to the legacy plain
filename on failure: for the delete flow using
self.conn.delete_file(f"{bucket}/{fnm}") (and similarly for
get/get_presigned_url/obj_exist), catch exceptions or a false/None result and
then attempt the same operation with the legacy path self.conn.delete_file(fnm)
(or self.conn.get_object(fnm), self.conn.get_presigned_url(fnm),
self.conn.object_exists(fnm) as appropriate); ensure both attempts preserve
return values/errors so callers receive the correct success/failure outcome and
avoid raising the first-path exception when the legacy attempt succeeds.
| def obj_exist(self, bucket, fnm): | ||
| try: | ||
| client = self.conn.get_file_client(fnm) | ||
| client = self.conn.get_file_client(f"{bucket}/{fnm}") | ||
| return client.exists() | ||
| except Exception: | ||
| logging.exception(f"Fail put {bucket}/{fnm}") |
There was a problem hiding this comment.
Correct the exception log message in obj_exist.
The obj_exist error path logs "Fail put ...", which is misleading during incident triage. Use an existence-specific message.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag/utils/azure_spn_conn.py` around lines 102 - 107, The obj_exist method
currently logs a misleading message "Fail put ..." on exception; update the
exception logging in obj_exist to use an existence-specific message such as
"Fail checking existence for {bucket}/{fnm}" (use the same f-string pattern) so
logs accurately reflect the operation and include the bucket/fnm details; modify
the logging.exception call inside obj_exist to that new message.
| def _install_stubs(): | ||
| """Replace heavyweight runtime modules so the connection modules can be | ||
| imported in isolation without the full ragflow runtime or the real |
There was a problem hiding this comment.
Isolate sys.modules stubbing to avoid cross-test pollution.
These fixtures replace global module entries but never restore them. That can make unrelated tests fail depending on run order. Please switch to monkeypatch.setitem (or restore snapshot on teardown) so patches are automatically reverted.
Also applies to: 74-83, 86-97
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/unit_test/rag/utils/test_azure_blob_bucket_prefix.py` around lines 31 -
33, The helper _install_stubs currently mutates sys.modules directly and never
restores entries, causing cross-test pollution; change the stubbing to use
pytest's monkeypatch.setitem to insert the fake modules (e.g., replace
sys.modules["ragflow.runtime"] etc.) so patches are automatically reverted after
each test, and update the three other stub blocks (the ones creating fake
modules around the same area) to use monkeypatch.setitem instead of direct
sys.modules assignment or add explicit teardown that restores originals if
monkeypatch is unavailable; locate uses in function _install_stubs and the other
stub creation sites and replace sys.modules[...] = fake_module with
monkeypatch.setitem(sys.modules, key, fake_module).
@Ricardo-M-L Would you please fix the unit test? |
What
The Azure SPN (
rag/utils/azure_spn_conn.py) and Azure SAS (rag/utils/azure_sas_conn.py) storage backends ignored thebucketparameter and wrote every file flat under the container using onlyfnm. Files with the same name uploaded from different datasets (differentkb_ids) silently overwrote each other in Azure Blob storage.Why
The MinIO and S3 backends already prepend the bucket as a path prefix (
<bucket>/<fnm>) for logical isolation between datasets. The Azure backends should follow the same contract.Change
Applies the
f\"{bucket}/{fnm}\"prefix pattern to all path-taking operations in both Azure backends:putcreate_file(fnm)/upload_blob(name=fnm, ...)create_file(f\"{bucket}/{fnm}\")/upload_blob(name=f\"{bucket}/{fnm}\", ...)getget_file_client(fnm)/download_blob(fnm)…(f\"{bucket}/{fnm}\")rmdelete_file(fnm)/delete_blob(fnm)…(f\"{bucket}/{fnm}\")obj_existget_file_client(fnm)/get_blob_client(fnm)…(f\"{bucket}/{fnm}\")get_presigned_urlget_presigned_url(\"GET\", bucket, fnm, expires)get_presigned_url(\"GET\", f\"{bucket}/{fnm}\", expires)healthcreate_file(fnm)/upload_blob(name=fnm, ...)…(f\"{bucket}/{fnm}\")(renamed_bucket→bucketsince it is now used)Tests
Adds
test/unit_test/rag/utils/test_azure_blob_bucket_prefix.py(12 tests, all passing) covering every modified method on both backends, plus a dedicated regression test that verifies two datasets uploading the same filename produce two distinct storage paths.The tests stub the
azureSDK and thecommon.settingsmodule so they run without Azure credentials or live connectivity.Notes
get_presigned_url, the Azure SPN/SAS implementations were already non-functional (the underlying Azure SDK clients don't expose this method); this change keeps the path argument consistent with the rest of the API surface as requested in the issue, in case the method is reworked later.Fixes #14159